Importing the data

Merging white and red dfs

The output labels range from 3 to 9. From the description that was originally given with the dataset the label shoul've contained values from 0 to 10.

9 has only 5 values

3 has 30 values

Merging some labels

The number of occurences for the label 9 is so small (only 5 of them) that it is impossible to accurately use this in a prediction, therefore I opt to merge the labels 8 and 9 becoming therefore 8 or higher

One hot encoding the type of wine

T-sne projection of the dataset into two axis

T-sne transform a high dimensional dataset into two dimensions while trying to preserve the concept of distance in the high dimensional space. If the points were far away in the high dimensional representation they should also be far away in the t-sne projection.

In this case, this high dimensional space doesn't seem to be particularly informative for the classification task.

Dividing into test and training sets

We stratify in order to train on data that is most similar to the data on which the dataset needs to predict.

SVM

Hard margin SVM

When we have a set of training set examples $S = (\textbf{x}_1,y_1),...,(\textbf{x}_m,y_m)$ where $\textbf{x}_i = R^d$ and $y_i \in \{\pm 1\}$ . We say that this set is linearly separable if there is a halfspace (the two parts into which a hyperplane divides the affine space).

$$ \forall i \in [m], y_i(\langle \textbf{w},\textbf{x}_i \rangle) > i $$

Furthermore we have that the distance between a point $\textbf{x}$ and the hyperplane defined by $(\textbf{w},b)$ where $\|w\| = 1$ is $|\langle \textbf{w,x}\rangle + b|$

The purpose of the svm is to find the hyperplane such that the distance between the hyperplane and the closest points is maximum. Intuitively this is so that the resulting hyperplane does a better job at generalizing the classification results. This can be formalized in the following way:

$$ \underset{(\textbf{w},b):\|\textbf{w}\|=1}{\mathrm{argmax}} \min_{i \in [m]} y_i (\langle \textbf{w},\textbf{x}_i + b \rangle) $$

Another equivalent formulation of the problem could be

$$ (\textbf{w}_0,b_0) = \underset{(\textbf{w},b)}{\mathrm{argmin}} \ \| \textbf{w} \| ^ 2 \\ s.t \\ \forall i, y_i (\langle \textbf{w},\textbf{x}_i +b \rangle) \geq 1 $$

output: $$ \hat{\textbf{w}} = \frac{\textbf{w}_0}{\| \textbf{w}_0 \|} \, \hat{b} = \frac{b_0}{\textbf{w}_0} $$

We enforce the correct separation of the labeled points in the conditions under s.t. Furthermore we force the margin to be 1, but we measure the width of the margin with the norm of $\textbf{w}$. Therefore finding the largest margin for this problem boils down to finding the minimum for $\textbf{w}$.

Soft Margin SVM

The input is $(\textbf{x}_1,y_1) ... (\textbf{x}_m,y_m)$

$$ \min_{\textbf{w},b,\epsilon } \left( \lambda \| \textbf{w} \| ^ 2 + \frac{1}{m}\sum_{i=1}^m \epsilon_i\right) \\ s.t \\ \forall i, y_i (\langle \textbf{w},\textbf{x}_i +b \rangle) \geq 1 - \epsilon_i $$

Output: $\textbf{w}, b$

The hard margin SVM assumes that you can linearly separate all of the points in the dataset. This is a very strong assumption that can be relaxed. We add the parameter $\epsilon$ that regulates of how big can the mistakes be. Furthermore the tradeoff between the width of the margin and the error that is made when classfiying

Kernel methods and the Kernel trick

Sometimes our dataset might not be linearly separable but it might be separable in a non linear space. The solution could be to map our original points into a non linear space and then perform an svm classification in the new space. The problem here is the mapping. Mapping points into a non linear space and then mapping them back into the original spac might be too computationally intensive, therefore we use kernels in order to obtain a way of performing classification in a non linear space.

Given some embeddint of data $\psi(x)$ into some hilbert space we define the kernel as the inner product of these points in the new mapping. $$ K(\textbf{x},\textbf{x'}) = \langle \psi(\textbf{x}), \psi(\textbf{x}) \rangle $$

This function can be thought of as a funtion that specifies similarity between two points. It turns out that we can carry out the svm classification by using only the result of the kernel function over every pair of points in the dataset.

Our svm problem can be rewritten in the following equivalent form:

$$ \min_{w} (f(\langle \textbf{w} , \psi(\textbf{x}_1)\rangle , ... , \langle \textbf{w} , \psi(\textbf{x}_m)\rangle) + R (\| \textbf{w} \|)) $$

where f is an arbitrary function and R is a monotonically nondecreasing function.

The representer theorem

Thanks to the representer theorem we have that the solution to the previous equation given that $\psi$ maps into a hilbert space is:

$$ \textbf{w} = \sum_{i=1}^m \alpha_i \psi(\textbf{x}_i) + \textbf{u} $$

Thanks to this theorem we can rewrite the minimization function in the following way

$$ \langle \textbf{w} , \psi(\textbf{x}_i) \rangle = \langle \sum_j \alpha_j \psi(\textbf{x}_j). \psi(\textbf{x}_i) \rangle= \sum_j \alpha_j \langle \psi(\textbf{x}_j). \psi(\textbf{x}_i) \rangle $$$$ \| \textbf{w} \|^2 = \langle \sum_j \alpha_j \psi(\textbf{x}_j), \sum_j \alpha_j \psi(\textbf{x}_j) \rangle= \sum_{i,j=1} \alpha_i \alpha_j \langle \psi(\textbf{x}_i) , \psi(\textbf{x}_j) \rangle $$

And therefore the original problem can be rewritten in this way:

$$ \min_{\alpha \in R^m} f \left( \sum_{j=1}^m \alpha_j K(\textbf{x}_j,\textbf{x}_1) ,..., \alpha_j K(\textbf{x}_j,\textbf{x}_m) \right) + R \left( \sqrt{\sum_{i,j=1}^m \alpha_i \alpha_j K(\textbf{x}_j,\textbf{x}_i) } \right) $$

Defining function for plotting the confusion matrix and accuracy

Random Forests

Out of bag error

Instead of using the crossvalidation technique, here we use the out of bag error. In this case we can fit the model on the whole dataset since the "testing" is done on the out of bag samples for each singular tree, then the average is computed

The times are almost the same, however the number of estimators in the oob version is 200 while the number of estimators in the crossvalidation version is 100.

Logistic regression

Smote -> TODO AFTER ALL THE COMPARISON BETWEEN THE ALGORITHMS

There is a problem with this division. Some classes have so few elements that it is impossible to perform any sort of meaningful crossvalidation. therefore we need to increase the number of samples in these underrepresented labels